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Abstract 

The paper provides results regarding the computational complexity of hybrid system iden¬ 
tification. More precisely, we focus on the estimation of piecewise affine (PWA) maps from 
input-output data and analyze the complexity of computing a global minimizer of the error. 
Previous work showed that a global solution could be obtained for continuous PWA maps with 
a worst-case complexity exponential in the number of data. In this paper, we show how global 
optimality can be reached for a slightly more general class of possibly discontinuous PWA 
maps with a complexity only polynomial in the number of data, however with an exponential 
complexity with respect to the data dimension. This result is obtained via an analysis of the 
intrinsic classification subproblem of associating the data points to the different modes. In 
addition, we prove that the problem is NP-hard, and thus that the exponential complexity in 
the dimension is a natural expectation for any exact algorithm. 


1 Introduction 

Hybrid system identification aims at estimating a model of a system switching between different 
operating modes from input-output data. More precisely, most of the literature considers autore¬ 
gressive with external input (ARX) models to cast the problem as a regression one [T]. Then, 
two cases can be distinguished: switching regression, where the system arbitrarily switches from 
one mode to another, and piecewise affine (PWA) regression, where the switches depend on the 
regressors. A number of methods with satisfactory performance in practice are now available for 
these problems [5]. However, compared with linear system identification, a major weakness of these 
methods is their lack of guarantees. 

For the particular case of noiseless data, the algebraic method [3] provides a solution to switching 
regression with a small number of modes. However, the quality of the estimates quickly degrades 
with the increase of the noise level. A few sparsity-based methods ma also offer guarantees in the 
noiseless case, but these are subject to a condition on both the data and the sought solution. In the 
presence of noise, most methods consider the minimization of the error of the model over the data 
[1]. While this does not necessarily yields the best predictive model (due to issues like identifiability, 
persistence of excitation and access to a limited amount of data), obtaining statistical guarantees 
with such an approach has a long history in statistics and system identification [5]. However, such 
results are not available for hybrid systems. This is probably due to the fact that minimizing the 
error of a hybrid model is a difficult nonconvex optimization problem involving the simultaneous 
classification of the data points into modes and the regression of a submodel for each mode. Thus, 
theoretical guarantees could only be obtained under the rather strong assumption that this problem 
has been solved to global optimality and most of the literature [71 m m [ini [III [ID focuses on this 
issue with heuristics of various degrees of accuracy and computational efficiency. Many recent 
works n (IllTl Hi [HIS] try to avoid local minima by considering convex formulations, but 
these only yield optimality with respect to a relaxation of the original problem. Global optimality 
in the presence of noise was only reached in m for a particular class of continuous PWA maps 
known as hinging-hyperplanes by reformulating the problem as a mixed-integer program solved by 
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branch-and-bound techniques. However, such optimization problems are NP-hard m and branch- 
and-bound algorithms have a worst-case complexity exponential in the number of integer variables, 
here proportional to the number of data and the number of modes. 

Inspired by related clustering problems, such as the minimization of the sum of squared distances 
between points and their group centers, we could minimize the hybrid model error by enumerating 
all possible classifications of the points. But the number of classifications is exponential in the 
number of data. Conversely, the other approach enumerating a sample of values for the real 
variables of the problem is exponential in the dimension and can only offer an approximate solution. 

Overall, the literature does not provide a method that can guarantee both the optimality and 
the computability of a global minimizer of the error, while the computational complexity of this 
problem remains unknown and cannot be deduced from the NP-hardness of classical clustering 
problems m (see nanni for an introduction to computational complexity and its relevance to 
control theory). 

Contribution The paper provides two results regarding the computational complexity of PWA 
regression, and more precisely for the problem of finding a global minimizer of the error of a PWA 
model, formalized in Sect.[^ First, we show in Sect.j^that the problem is NP-hard. Then, we show 
in Sect, [^that, for any fixed dimension of the data, an exact solution can be computed in time 
polynomial in the number of data via an enumeration of all possible classifications. To obtain this 
result and avoid the exponential growth of the number of classifications with the number of data, 
we show that, in PWA regression, the classification of the data points is highly constrained and 
the number of classifications to test can be limited. The price to pay for this gain is an exponential 
complexity with respect to the data dimension and the number of modes. Future work is outlined 
in Sect. [Sl 

Notations We use the indicator function 1e of an event E that is 1 if the event occurs and 0 
otherwise. We define sign(M) = 1 if m > 0 and —1 otherwise. Given a set of labels Q C Z and a set 
of N points, a labeling of these points is any q G . We use j = argmax^^g u{k) as a shorthand 
for j = min{Z G argmaxj,gg u{k)}. Given two sets, X and y, is the set of functions from X 
to 3^. 

2 Problem formulation 

As in most works, we concentrate on discrete-time PWARX system identification considered as a 
PWA regression problem with regression vectors Xi = ... ,yi-ny,Ui,... G X built 

from past inputs Ui-k and outputs yi-k- Since we are interested in computational complexity 
results, we restrain the data to rational, digitally representable, values and set X C Q^. The 
outputs are assumed to be generated by a PWA system / as j/i = f(xi) + Vi, where Vi is a noise 
term. More precisely, PWA models can be expressed via a set of n affine submodels and a function 
h : X — Q = {1,... ,n} determining the active submodel: f{x) — where x — [x’^, 1]^. 

We call the function h a classifier as it classifies the data points in the different modes. Typically, 
PWA systems are defined with h implementing a polyhedral partition of X, with modes possibly 
spanning unions of polyhedra. However, in most of the literature on PWA system identification 
[n 13 in HU, h is estimated within the family of linear classifiers 

H = {h e : h{x) = argmaxh-^a; -|- bk, hk G bk G Q}, (1) 

k&Q 

based on a set of n linear functions and for which a mode spanning a union of polyhedra must 
be modeled as several modes with similar affine submodels. For PWA maps with n = 2 modes, 
/i is a binary classifier for which it is common to consider its output in Q = {—1,-|-1} instead of 
{1,2}. Such a binary classifier can be obtained by taking the sign of a real-valued function. If this 
function is linear (or affine), then we obtain a linear classifier, which is equivalent to a separating 
hyperplane dividing the input space X in two half-spaces. In this case, the function class % can 
be defined as 

TL = {h G : h{x) = sign(Ii^a; + b), h G b G Q} (2) 
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with a single set of parameters {h, b) corresponding to the normal to the hyperplane and the 
offset from the origin. An equivalence with the multi-class formulation in Q is obtained by using 
h = hi — h 2 and b — bi — b 2 - 

In this paper, we consider the common estimation approach of minimizing the error on N data 
pairs € A x Q, measured pointwise by a loss function £ : Q —>■ Q+ as 


f-iVi - fixi)) = ^h{x,)=j - wjx,). 

jeC 

More precisely, we focus on well-posed instances of the problem where N is significantly larger 
than the dimension d and the number of modes n is given. Indeed, with free n the problem is 
ill-posed as the solution is only defined up to a trade-off between the number of modes and the 
model accuracy. For the converse well-posed approach that minimizes n for a given error bound, 
a complexity analysis can be found in |21j . Under these assumptions, the problem is as follows. 

Problem 1 (Error-minimizing PWA regression). Given a data set G (A x Q)^ with 

A C Q'^ and an integer n € [2, N/{d + 1)], find a global solution to 


min 


2 ^ ^ 
jeQ 


T— \ 

Wj Xi), 


(3) 


where w = {wj)j^Q is the coneatenation of all parameter vectors and H C is the set of 
n-category linear classifiers as m Q or ([^. 

The following analyzes the time complexity of Problem under the classical model of com¬ 
putation known as a Turing machine m- The time complexity of a problem is the lowest time 
complexity of an algorithm solving any instance of that problem, where the time complexity of 
an algorithm is the maximal number of steps occuring in the computation of the corresponding 
Turing machine program. The loss function £ is assumed to be computable in polynomial time 
throughout the paper. 


3 NP-hardness 


This section contains the proof of the following NP-hardness result, where an NP-hard problem is 
one that is at least as hard as any problem from the class NP of nondeterministic polynomial time 
decision problems [TH] (NP is the class of all decision problems for which a solution can be certified 
in polynomial time). 

Theorem 1. With a loss function £ such that £{e) = 0 e = 0, Problem^ is NP-hard. 

The proof uses a reduction from the partition problem, known to be NP-complete |18j . i.e., a 
problem that is both NP-hard and in NP. 

Problem 2 (Partition). Given a multiset (a set with possibly multiple instances of its elements) 
of d positive integers, S = {si,..., s^}, decide whether there is a multisubset Si C S such that 

Y 

siSSi sies\Si 

More precisely, we will reduce Problem to the decision form of Problem 

Problem 3 (Decision form of PWA regression). Given a data set {{xi,yi)})fi G (A x Q)^, an 
integer n € [2,7V/(fi-|-1)] and a threshold e > 0, decide whether there is a pair {w, h) G x "H 

such that 

1 ^ 

< e> (4) 

jeQ 

where PL is the set of linear classifiers as m Q or ([^ and the loss function £ is such that £{e) = 
0^ e = 0. 
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Proposition 1. Prohlem^is NP-complete. 

Proof. Since given a candidate pair (w, h) the condition Q can be verified in polynomial time, 
Problem 1^ is in NP. Then, the proof of its NP-completeness proceeds by showing that the Partition 
Problemhas an affirmative answer if and only if Problemwith e = 0 has an affirmative answer. 
Given an instance of Problem]^ let iV = 2c? + 3, n = 2, Q = {—1,1} and build a data set with 


(xi, Vi) = < 


{SiBi, Si), 

( ^i—d)j 

(s, 0), 

(-S, 0), 

( 0 , 0 ), 


if 1 < f < c? 
if d < * < 2c? 
if f = 2o? + 1 
if f = 2d + 2 
if f = 2d + 3, 


where is the /cth unit vector of the canonical basis for Q'^ and s = Sk^-k. If Problemj^has 

an affirmative answer, then, using the notations of ([^, we can set 


'Wi = '^ek- ^ efc. 


'W-1 = -'*^1, 


h= '^Bk- ^ Bk, b = 0, 

k^Ii 1 


where Bk = [e^, 0]^, Ii is the set of indexes of the elements of S in and /_i = {1,..., d} \ Ii. 
This gives 


— Vi-) 




T— 

Wi Xj = 


^i—d — Vit 

^keh ~ 'Sfc = 0 = dii 

^kel-i ~ SfcG/i Sfc = 0 = dii 

^0 = y^, 


ii i < d and i G Ii 
ii i < d and i G /_i 
if f > d and i — d G /_i 
if f > d and i — d G Ii 
if f = 2d + 1 
if f = 2d + 2 
if f = 2d + 3 


and we can similarly show that 

w^iXi = Ui, Hi G I-i V f — d G Ii V z > 2d, 

while h/'"Xi is positive if z G /i V z — d G /_i and negative if z G /_i V z — d G /i- Therefore, for 
all points, ^Xi = z/^, z = 1,..., 2d + 3, and the cost function of Problemj^is zero, yielding an 
affirmative answer for Problem [S] 

It remains to prove that if Q holds with e = 0, then Problem]^ has an affirmative answer. To 
see this, note that due to £ being positive, a zero cost implies a zero loss for all data points. Thus, 
by £{e) = 0 e = 0, if Q holds with e = 0, 

= Vi, z = l,...,2d + 3. (5) 

Also note that if h[xi) = hlxi+d) = 1 for some z < d, we have SiWid + wi^d+i = —SiWi^i+wi^d+i = 
Si. This is only possible if Si — 0, which is not the case (otherwise we can simply remove Si without 
influencing the partition problem), or if = Si. The latter is impossible if h{xi) = h{xi.i.d) 

since h is a linear classifier that must return the same category for all points on the line segment 
between Xi and which includes the origin a: 2 d +3 = 0 and thus would imply by ([^ that 

'wJ'x 2 d +3 = wi^d+i = y 2 d +3 = 0. As a consequence, h(xi) = h(xi+d) = 1 cannot hold, and since 
we can similarly show that h{xi) = h{xi+d) = —1 cannot hold, we have h{xi) h{xi+d) for all 
z < d. Hence, ([^ leads to 

Xi yi '^1 Xd+i — yd+i-, f — 1, . . . , d (6) 

^ —\Xi ^ yi W_.^Xd.\-’i. — yd+i-i f — 1,..., d. (7) 
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Let ii = {i G d} : wfxi = j/J and /_i = {1,..., d} \ Ii. Then, if h{x 2 d+ 3 ) = +1, 

wi^d+i = 0 and for all i < d, wjxi = WiSi. Therefore, for all i G /i, Wi = 1, while for all i G /_i, 
([^ gives wfxd+i = Ud+i, he., —wtSi = Si and Wi = —1. This leads to 

w1x2d+l = ^ Si = -wJx2d+2- 

iG/i iei-i 


Thus, if wi X 2 d+i = y 2 d+i = 0 or wl X 2 d +2 = y 2 d +2 = 0, a valid partition in the sense of 
Problem 2 is obtained with Si = In addition, if WiX 2 d+i 7 ^ 0 and WiX 2 d +2 ^ 0, 

then by (51, w'^^X 2 d+i = w^iX 2 d +2 = 0, which by construction implies that W-i^d+i = 0. In 
this case, we redefine /_i = {i G {I,..., d} : w'^^x^ = yi} and Ii = {I,..., d} \ J_i to obtain 
W-id = 1 for all i G I-i and ic-iy = —1 for all i G di, resulting also in a valid partition by the 
fact that w'^^X 2 d +2 = SieL ~ ^ = 0. Since a similar reasoning applies to the case 

h{x 2 d+ 3 ) = —1 by symmetry (substituting W-i for Wi), a zero cost, i.e., 0 with e = 0, always 
implies an affirmative answer to Problem]^ □ 

Proof of Theorem^ Since the decision form of Problemwith i{e) = 0 e = 0, i.e., Problem]^ 
is NP-complete, Problemwith such a loss function is NP-hard (solving Problemalso yields the 
answer to Problem and thus it is at least as hard as Problem §. □ 


4 Polynomial complexity in the number of data 


We now state the result regarding the polynomial complexity of Problem[2with respect to N under 
the following assumptions, the first of which holds almost surely for randomly drawn data points, 
while the second one holds for instance for £{e) = with a linear time complexity T{N) = 0{N) 

m- 

Assumption 1. The points {xi}f^i are in general position, i.e., no hyperplane of contains 
more than d points. 

Assumption 2. Given {{xi,yi)}f^i G (A" x Q)^, the problem min„gQd+i ~ has 

a polynomial time complexity T{N) for any fixed integer d > 1. 

Theorem 2. For any fixed number of modes n and dimension d, under Assumptions^^^ the time 
complexity of ProblemlJ] is no more than polynomial in the number of data N and in the order of 
T(A)0(A‘^"("-i)/2). 

The proof of Theorem relies on the existence of exact algorithms with complexity polynomial 
in N for the binary case (n = 2, Proposition]^ and the multi-class case (n > 3, Corollary]^. 
These algorithms are based on a reduction of Problem [l] to a combinatorial search in two steps. 
The first step reduces the problem to a classification one. Indeed, Problem can be reformulated 
as the search for the classifier h, since by fixing h, the optimal parameter vectors {wj}j,^Q can be 
obtained by solving n independent linear regression problems on the subsets of data resulting from 
the classification by h, which, by Assumption]^ can be performed in the polynomial time T{N). 
This yields the following reformulation of the problem. 

Proposition 2. Problem\^is equivalent to 


min 


i=l iGQ 


N 

s.t. Vj G Q, Wj G argmin ^ lh{a.,)=Jiiy^-v^x,). 

l,gQd+l 


( 8 ) 


The second step reduces the estimation of d to a combinatorial problem solved in 
operations, as detailed in Sect. |4.1H4.2| for n = 2 and in Sect. |4.3| for n > 3. 


0 
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4.1 Finding the optimal classification 

We reduce the complexity of searching for the classifier by considering all possible linear classifi¬ 
cations instead of all possible linear classifiers. In other words, we project the class Ti. of classifiers 
onto the set of points S = to reduce a continuous search to a combinatorial problem. This 

is in line with the techniques used in statistical learning theory |23j for the different purpose of 
computing error bounds for infinite function classes. Thus, we introduce definitions from this field. 

Definition 1 (Projection onto a set). The projection of a set of classifiers TL C onto S = 
{ajilili, denoted TLs, is the set of all labelings of S that can be produced by a classifier in TL: 

TLs = {{h{xi),... ,h{xN)) ■■ hGTLjCQ^. 

Definition 2 (Growth function). The growth function 11^(TV) ofTL at N is the maximal number 
of labelings of N points that can be produced by classifiers from TL: 


I1h{N) = sup \ns\- 


We now focus on binary PWA maps and thus on binary classifiers with output in Q = {—1, -bl}. 
For such classifiers, we obviously have 11^ (iV) < 2^ for all N. By further restricting TL to affine 
classifiers as in (1^, results from statistical learning theory (see, e.g., [23]) provide the tighter bound 

/ \ d-\- 1 

n^(A^) < ( 1 , which is polynomial in N and thus promising from the viewpoint of global 

optimization. However, its proof is not constructive and does not provide an explicit algorithm 
for enumerating all the labelings. The following theorem, though leading to a looser bound on 
the growth function, offers a constructive scheme to compute the projection TLs, which is what we 
need in order to test all the labelings in TLs for global optimization. 


Theorem 3. The growth function of the class of binary affine classifiers of TL in 
bounded for any N > d by 


Un{N)<2‘^+^ 




(i), 


is 


and, for any set S of N points in general position, an algorithm builds the projection TLs in- 0{N‘^) 
time. 


The proof of Theorem relies on the following proposition, which is illustrated by Fig. 

Proposition 3. For any binary affine classifier h in TL ^ and any finite set of N > d points 
S = general position, there is a subset of points Sh C S of cardinality jS'/jj = d and a 

separating hyperplane of parameters {hs,,,bs,,) passing through the points in Sh, i.e., 

yx e Sh, hg^x + bs,^=0, with \\hs,^\\ = I, (9) 

which yields the same classification of S in the sense that 

yxi£S\Sh, h{x,) = sign{hs^Xi + bsj. (10) 


Proof sketch. For all classifiers h with separating hyperplanes passing through d points of S, 
the statement is obvious. For the others passing through p points with 0 < p < d, they can be 
transformed to pass through additional points without changing the classification of the remaining 
points. If p = 0, it suffices to translate the hyperplane to the closest point. If 0 < p < d, the 
hyperplane can be rotated with a plane of rotation that leaves unchanged the subspace spanned by 
the p points and a minimal angle yielding a rotated hyperplane passing through p' > p points, where 
p' < d by the general position assumption. Iterating this scheme until p = d yields a hyperplane 
passing through the points in Sh of parameters {hs,,,bs,f) satisfying Q and (10). □ 

We can now prove Theorem 
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Figure 1: The hyperplane H (plain line) produces the same classification (into + and o) as the 
hyperplane Hg (dashed line) obtained by a translation (dotted line) and a rotation of H such that 
it passes through exactly 2 points of S (*). 


Proof of Theorem^^ For any labeling q in "Hs, there is a classifier h G H that produces this 
labeling. Applying Propositionto h, we obtain another classifier hs^ of parameters {hs^,bs^) 
that passes through the points in Sh and that agrees with h on S\Sh- Let q G {—1, +1}^ be defined 
by Qi = hsh {xi), i = . ,N . Then, we generate 2'^ labelings by setting its entries qi with i G Sh to 

all the 2'^ combinations of signs (recall that |5';i| = d). By construction, there is no labeling of S that 
agrees with q on S\Sh other than these 2‘^ labelings. Since this holds for any q G Hs, the cardinality 
of Ps cannot be larger than 2‘^ times the number of hyperplanes passing through d points of S. 
Since each subset Sh C S of cardinality d gives rise to two hyperplanes of opposite orientations. 


the number of such hyperplanes is 2 J and we have n>^(iV) < 2^^+^ 
In addition, there is an algorithm that enumerates all the subsets Sh in 



< 2^+^^ = 0(iV‘^). 
iterations and builds 


Ps by computing a hyperplane passing through the point^ in Sh and the corresponding 2'^+^ 
labelings at each iteration. Since these inner computations can be performed in constant time with 

respect to N, the algorithm has a time complexity in the order of ( ^ ) = 0{N‘^). □ 


4.2 Global optimization of binary PWA models 

We can use the results above to reduce the complexity of Problem[^in the binary case, considered in 
the following in its equivalent form Q from Proposition]^ First, note that the cost function in Q 
only depends on h, since all feasible values of w for a given h yield the same cost. Furthermore, 
the cost does not depend on the exact value of h, but only on the resulting classification, i.e., 
on h{xi), i = Thus, given a global solution h* to Q, any classifier h producing the 

same classification yields the same cost function value and hence is also a global solution. Thus, 
the problem reduces to the search for the correct classification q G Ps, whose complexity is in 
0(n'H(iV)) and bounded by Theorem]^ In addition, for the purpose of binary PWA regression, 
opposite labelings q and —q are equivalent and can be pruned from Ps- This is due to the 
symmetry of the cost function (|^ . Algorithm provides a solution to Problem for the binary 
case while taking this symmetry into account. 


^The normal fi. of a hyperplane {as : hd"x + b = 0 } passing through d points in can be computed as a 

unit vector in the null space of [x 2 — xi,..., x^ — xi]"^, while the offset is given by 6 = —h^Xi for any of the XiS. 
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Algorithm 1 Exact solution to Problem for n = 2 

Input: A data set {{xi,yi)}fLi C (Q'^ x Q)-^. 

Initialize S ^ ^.nd J* i —hoo. 

for all Sh C S such that |S'?i| = d do 

Compute the parameters {hs^,bs^) of a hyperplane passing through the points in Sh- 
Classify the data points: Si = {xi G S : h^^Xi + bs^ > 0}, S 2 = {xi G S : h^^Xi + bs^ < 0}. 

for all classification of Sh into Sl and do 
Set Wj G argmin ^ i{yi — v'^Xi), j = 1, 2, 

1 2 

i=i xieSjUSl 

and update the best solution (J* ^ J,w* G- h. * ^ ^ ^Sh) if < J*. 

end for 
end for 

return w*,h*,b*. 


Proposition 4. Under Assumptions [7[|^ Algorithm^ exactly solves Prohlem^for n = 2 and any 
fixed d with a polynomial complexity in the order of T{N)0{N‘^). 


Proof. By following a similar path as for Theorem Algorithm [I] can be proved to test all linear 
classifications of the data points up to symmetric ones. Since Algorithm computes a solution in 
terms of w that is feasible for ([^ for each of these classifications, the value of J coincides with the 
cost function of ([^ for a particular h. By the symmetry of this cost function with respect to h 
and the fact that it only depends on h via its values at the data points, Algorithm computes all 
possible values of the cost function, including the exact global optimum of Q, and returns a global 
Thus, by Proposition it also solves Problem The total number of iterations of 

h n 

= 0(A“) and, under Assumption ! 


mimmizer. 


Algorithm 


is 2^ 


these iterations only involve operations 


computed in polynomial time in the order of T(N), hence the overall time complexity in the order 
ofr(A)0(A^). □ 


4.3 Multi-class extension 

For n > 2, the boundary between 2 modes j and k > j implemented by a linear classifier from T-L in 
Q is a hyperplane of equation hjk{x) = hj{x) — hk{x) = 0, i.e., based on the difference of the two 
functions hj{x) = x + bj and hk{x) = h/[x + bk- Based on these hyperplanes, the classification 
rule can be written as 


h{x) 


argmax hk{x) 
keQ 


j, such that 


hjk{x) > 0, VA: > j, 
hkj{x) < 0, V/c < j. 


Based on these facts, we can build an algorithm to recover all possible classifications consistent 
with a linear classification in the sense of Q. 

Theorem 4. For the set of multi-class linear classifiers of "H in Q, the growth function is 
bounded for any N > d by 


n«(A) < 



n(n— 1)/2 


0^jYdn(n-l)/2) 


and, for any set S of N points ofQ‘^ in general position, an algorithm builds Ps in 0(A^"(" 
time. 











Proof. Any classification produced by a classifier from Q can be computed from the signs of the 
riH = n(n — l)/2 functions hjk = hj — h^, 1 < j < k < n, corresponding to the pairwise separating 
hyperplanes. For any S, for each of these hyperplanes, Proposition [^provides an equivalent binary 

fN\ 

classifier which must be one from the 2 1^1 hyperplanes passing through d points Sjk of S. The 


number of sets of nn such hyperplanes is 2"" 



Since these classifiers cannot produce all the 


2 ^Hd classifications of the rind points in the sets Sjk, we must also take these into account so that 

/ N\ 

the number of classifications of S is upper bounded by j'HsI < 2""‘^2”^ ( rf ) ~ 0(A^‘^""). This 


upper bound holds for any S, and thus also applies to the growth function. Finally, an algorithm 
that makes explicit all the classifications mentioned above to build Ps can be constructed in a 
recursive manner, with one classification per iteration and thus with a similar number of iterations, 
each one including computations performed in constant time. □ 


Theorem 1^ implies the following for PWA regression. 

Corollary 1. Under Assumptions^^^ a global solution to Problem^with n > 3 can be computed 
with a polynomial complexity in the order o/T(A^)0(A^‘^"0“^)/^). 


5 Conclusions 

The paper discussed complexity issues for PWA regression and showed that i) the global minimiza¬ 
tion of the error is NP-hard in general, and ii) for fixed number of modes and data dimension, an 
exact solution can be obtained in time polynomial in the number of data. The proof of NP-hardness 
also implies that the problem remains NP-hard even when the number of modes is fixed to 2, which 
indicates that the complexity is mostly due to the data dimension. An open issue concerns the 
conditions under which a PWA system generates trajectories satisfying the general position as¬ 
sumption used by the polynomial-time algorithm. Future work will also focus on the extension of 
the results to the case of arbitrarily switched systems and heuristics inspired by the polynomial¬ 
time algorithm, whose practical application remains limited by an exponential complexity in the 
dimension. 


Acknowledgements 

The author would like to thank the anonymous reviewers for their comments and suggestions. 
Thanks are also due to Yann Guermeur for carefully reading this manuscript. 


References 

[1] S. Paoletti, A. L. Juloski, G. Ferrari-Trecate, R. Vidal, Identification of hybrid systems: a 
tutorial, European Journal of Control 13 (2-3) (2007) 242-262. 

[2] A. Garulli, S. Paoletti, A. Vicino, A survey on switched and piecewise affine system iden¬ 
tification, in: Proc. of the 16th IFAC Symp. on System Identification (SYSID), 2012, pp. 
344-355. 

[3] R. Vidal, S. Soatto, Y. Ma, S. Sastry, An algebraic geometric approach to the identification 
of a class of linear hybrid systems, in: Proc. of the 42nd IEEE Conf. on Decision and Control 
(GDC), Maui, Hawai', USA, 2003, pp. 167-172. 

[4] L. Bako, Identification of switched linear systems via sparse optimization, Automatica 47 (4) 
(2011) 668-677. 

[5] I. Maruta, H. Ohlsson, Compression based identification of PWA systems, in: Proc. of the 
19th IFAC World Congress, Cape Town, South Africa, 2014, pp. 4985-4992. 


9 


[6] L. Ljung, System identification: Theory for the User, 2nd Edition, Prentice Hall, 1999. 

[7] G. Ferrari-Trecate, M. Muselli, D. Liberati, M. Morari, A clustering technique for the identi¬ 
fication of piecewise affine systems, Automatica 39 (2) (2003) 205-217. 

[8] A. Bemporad, A. Garulli, S. Paoletti, A. Vicino, A bounded-error approach to piecewise affine 
system identification, IEEE Transactions on Automatic Control 50 (10) (2005) 1567-1580. 

[9] A. L. Juloski, S. Weiland, W. Heemels, A Bayesian approach to identification of hybrid sys¬ 
tems, IEEE Transactions on Automatic Control 50 (10) (2005) 1520-1533. 

[10] F. Lauer, G. Bloch, R. Vidal, A continuous optimization framework for hybrid system identi¬ 
fication, Automatica 47 (3) (2011) 608-613. 

[11] F. Lauer, Estimating the probability of success of a simple algorithm for switched linear regres¬ 
sion, Nonlinear Analysis: Hybrid Systems 8 (2013) 31-47, supplementary material available 
at http://www.loria.fr/~lauer/klinreg/, 

[12] T. Pham Dinh, H. Le Thi, H. Le, F. Lauer, A difference of convex functions algorithm for 
switched linear regression, IEEE Transactions on Automatic Control 59 (8) (2014) 2277-2282. 

[13] N. Ozay, M. Sznaier, C. Lagoa, O. Camps, A sparsification approach to set membership 
identification of switched affine systems, IEEE Transactions on Automatic Control 57 (3) 
(2012) 634-648. 

[14] H. Ohlsson, L. Ljung, S. Boyd, Segmentation of ARX-models using sum-of-norms regulariza¬ 
tion, Automatica 46 (6) (2010) 1107-1111. 

[15] H. Ohlsson, L. Ljung, Identification of switched linear regression models using sum-of-norms 
regularization, Automatica 49 (4) (2013) 1045-1050. 

[16] F. Lauer, V. L. Le, G. Bloch, Learning smooth models of nonsmooth functions via convex 
optimization, in: Proc. of the IEEE Int. Workshop on Machine Learning for Signal Processing 
(MLSP), Santander, Spain, 2012. 

[17] J. Roll, A. Bemporad, L. Ljung, Identification of piecewise affine systems via mixed-integer 
programming, Automatica 40 (1) (2004) 37-50. 

[18] M. Carey, D. Johnson, Computers and Intractability: a Guide to the Theory of NP- 
Completeness, W.H. Freeman and Company, 1979. 

[19] D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-hardness of Euclidean sum-of-squares 
clustering. Machine Learning 75 (2) (2009) 245-248. 

[20] V. Blondel, J. Tsitsiklis, A survey of computational complexity results in systems and control, 
Automatica 36 (9) (2000) 1249-1274. 

[21] R. Alur, N. Singhania, Precise piecewise affine models from input-output data, in: Proc. of 
the 14th Int. Conf. on Embedded Software (EMSOFT), 2014. 

[22] G. H. Golub, C. F. Van Loan, Matrix Computations, 4th Edition, John Hopkins University 
Press, 2013. 

[23] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998. 


10 


